Abstract:Aiming at the deficiencies of 3D convolutional neural network and two-stream convolutional neural network for human activities recognition in video, a composite deep neural network combining two-stream convolutional network and 3D convolutional network is proposed. The improved residual(2+1)D convolutional neural network is utilized in both the temporal sub-network and the spatial sub-network of two-stream architecture. Behavioral representation and classification methods are learned from RGB and optical flow of video, respectively. The classification results of temporal stream and spatial stream sub-networks are combined. Furthermore, in the process of network training, stochastic gradient descent with the momentum improved by gradient centralization algorithm is proposed to improve the network generalization performance without varying the network structure. Experimental results show that the proposed network achieves higher accuracy on UCF101 and HMDB51.
[1] HERATH S, HARANDI M, PORIKLI F. Going Deeper into Action Recognition: A Survey. Image and Vision Computing, 2017, 60: 4-21. [2] ZHU Y, LI X Y, LIU C H, et al. A Comprehensive Study of Deep Video Action Recognition[C/OL].[2022-03-07]. https://arxiv.org/pdf/2012.06567.pdf. [3] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A Large Video Database for Human Motion Recognition // Proc of the International Conference on Computer Vision. Washington, USA: IEEE, 2011: 2556-2563. [4] SOOMRO K, ZAMIR A R, SHAH M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild[C/OL]. [2022-03-07]. https://arxiv.org/pdf/1212.0402.pdf. [5] KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics Human Action Video Dataset[C/OL].[2022-03-07]. https://arxiv.org/pdf/1705.06950.pdf. [6] GHADIYARAM D, TRAN D, MAHAJAN D. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 12038-12047. [7] KARPATHY A, TODERICI G, SHETTY S, et al. Large-Scale Video Classification with Convolutional Neural Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2014: 1725-1732. [8] KATAOKA H, WAKAMIYA T, HARA K, et al. Would Mega-Scale Datasets Further Enhance Spatiotemporal 3D CNNs?[C/OL]. [2022-03-07]. https://arxiv.org/pdf/2004.04968v1.pdf. [9] CARREIRA J, ZISSERMAN A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 4724-4733. [10] HARA K, KATAOKA H, SATOH Y. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6546-6555. [11] QIU Z F, YAO T, MEI T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 5534-5542. [12] TRAN D, WANG H, TORRESANI L, et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6450-6459. [13] SIMONYAN K, ZISSERMAN A. Two-Stream Convolutional Networks for Action Recognition in Videos. Communications of the ACM, 2017, 60(6): 84-90. [14] WANG L M, XIONG Y J, WANG Z, et al. Towards Good Practices for Very Deep Two-Stream ConvNets[C/OL].[2022-03-07]. https://arxiv.org/pdf/1507.02159.pdf. [15] FEICHTENHOFER C, PINZ A, WILDES R P.Spatiotemporal Residual Networks for Video Action Recognition // Proc of the 30th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2016: 3476-3484. [16] WANG L M, XIONG Y J, WANG Z, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 20-36. [17] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional Two-Stream Network Fusion for Video Action Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 1933-1941. [18] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Cla-ssification with Deep Convolutional Neural Networks. Communications of the ACM, 2017, 60(6): 84-90. [19] SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition[C/OL]. [2022-03-07]. https://arxiv.org/pdf/1409.1556.pdf. [20] LI X, CHEN S, HU X L, et al. Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 2677-2685. [21] DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: Learning Optical Flow with Convolutional Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 2758-2766. [22] ILG E, MAYER N, SAIKIA T, et al. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 1647-1655. [23] SUN D Q, YANG X D, LIU M Y, et al. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2018: 8934-8943. [24] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature Pyramid Networks for Object Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 936-944. [25] ZHU X Z, XIONG Y W, DAI J F, et al. Deep Feature Flow for Video Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 4141-4150. [26] XU J, RANFTL R, KOLTUN V. Accurate Optical Flow via Direct Cost Volume Processing // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 5807-5815. [27] YONG H W, HUANG J Q, HUA X S, et al. Gradient Centralization: A New Optimization Technique for Deep Neural Networks // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 635-652. [28] TRAN D, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 4489-4497. [29] KUMAWAT S, VERMA M, NAKASHIMA Y, et al. Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. DOI: 10.1109/TPAMI.2021.3076522 [30] WANG M M, XING J Z, LIU Y.ActionCLIP: A New Paradigm for Video Action Recognition[C/OL]. [2022-03-07].https://arxiv.org/pdf/2109.08472.pdf.